23 Feb 2017
Do other time series analysis
But also other types of analysis involve processing timestamp data.
library(padr) library(dplyr) padr::emergency %>% head
## # A tibble: 6 × 6 ## lat lng zip title time_stamp ## <dbl> <dbl> <int> <chr> <dttm> ## 1 40.29788 -75.58129 19525 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 ## 2 40.25806 -75.26468 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 ## 3 40.12118 -75.35198 19401 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 ## 4 40.11615 -75.34351 19401 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 ## 5 40.25149 -75.60335 NA EMS: DIZZINESS 2015-12-10 17:40:01 ## 6 40.25347 -75.28324 19446 EMS: HEAD INJURY 2015-12-10 17:40:01 ## # ... with 1 more variables: twp <chr>
Every row is a single observation, typically on second level. You want to do analysis on a (much) higher level.
padr offers: thicken used in conjunction with a database package, like dplyr.emergency %>% thicken(interval = "month") %>% count(time_stamp_month) %>% head
## # A tibble: 6 × 2 ## time_stamp_month n ## <date> <int> ## 1 2015-12-01 7969 ## 2 2016-01-01 13205 ## 3 2016-02-01 11467 ## 4 2016-03-01 11101 ## 5 2016-04-01 11326 ## 6 2016-05-01 11423
When there is no observation, there is no record.
padr offers: paddata.frame(dt = as.Date(c("2017-02-23", "2017-02-26")),
val = c(2, 4)) %>%
pad
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 NA ## 3 2017-02-25 NA ## 4 2017-02-26 4
Think of timedata as having a hearbeat. It produces data at a certain interval.
padr currently uses eight intervals: year, quarter, month, week, day, hour, minute, and second.
get_interval(emergency$time_stamp)
## [1] "sec"
The interval is the highest of the eight that can explain all the instances observed in the data.
dt <- as.Date(c("2017-02-23", "2017-02-26"))
all(dt %in% seq(dt %>% min, dt %>% max, by = 'day'))
## [1] TRUE
The thicken function takes in a data frame, then it does:
thicken parameters:
x
interval = c("level_up", "year", "quarter", "month",
"week", "day", "hour", "min")
colname = NULL
rounding = c("down", "up")
by = NULL
start_val = NULL
The pad function takes in a data frame, then it does:
NA values for the other variables.x interval = NULL start_val = NULL end_val = NULL by = NULL group = NULL
Last week v0.2.0 came out (and patch release v0.2.1 :) ), that introduced group padding.
emergency %>%
thicken('month', col = "m") %>%
count(m, title) %>%
pad(group = "title",
start_val = as.Date("2015-12-01"),
end_val = as.Date("2016-10-01"))
## # A tibble: 1,287 × 3 ## m title n ## * <date> <chr> <int> ## 1 2015-12-01 EMS: ABDOMINAL PAINS 128 ## 2 2016-01-01 EMS: ABDOMINAL PAINS 186 ## 3 2016-02-01 EMS: ABDOMINAL PAINS 161 ## 4 2016-03-01 EMS: ABDOMINAL PAINS 184 ## 5 2016-04-01 EMS: ABDOMINAL PAINS 185 ## 6 2016-05-01 EMS: ABDOMINAL PAINS 162 ## 7 2016-06-01 EMS: ABDOMINAL PAINS 158 ## 8 2016-07-01 EMS: ABDOMINAL PAINS 143 ## 9 2016-08-01 EMS: ABDOMINAL PAINS 176 ## 10 2016-09-01 EMS: ABDOMINAL PAINS 174 ## # ... with 1,277 more rows
After padding you are left with the missing values for the imputed records.
padded_df <-
data.frame(dt = as.Date(c("2017-02-23", "2017-02-25", "2017-02-27")),
val = c(2, 4, 2)) %>% pad
padded_df
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 NA ## 3 2017-02-25 4 ## 4 2017-02-26 NA ## 5 2017-02-27 2
Depending on the nature of the data you might want to:
Carry the last value forward
padded_df %>% tidyr::fill(val)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 2 ## 3 2017-02-25 4 ## 4 2017-02-26 4 ## 5 2017-02-27 2
Depending on the nature of the data you might want to:
Fill all the missings with the same value
padded_df %>% fill_by_value(val, value = 42)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 42 ## 3 2017-02-25 4 ## 4 2017-02-26 42 ## 5 2017-02-27 2
Depending on the nature of the data you might want to:
Fill all the missings with a function of the nonmissings
padded_df %>% fill_by_function(val, fun = mean)
## dt val ## 1 2017-02-23 2.000000 ## 2 2017-02-24 2.666667 ## 3 2017-02-25 4.000000 ## 4 2017-02-26 2.666667 ## 5 2017-02-27 2.000000
Depending on the nature of the data you might want to:
Fill all the missings with the most prevalent of the nonmissings
padded_df %>% fill_by_prevalent(val)
## dt val ## 1 2017-02-23 2 ## 2 2017-02-24 2 ## 3 2017-02-25 4 ## 4 2017-02-26 2 ## 5 2017-02-27 2
library(ggplot2) animal_bites_plot <- emergency %>% filter(title == 'EMS: ANIMAL BITE') %>% thicken(interval = 'day', col = 'ts_day') %>% count(ts_day) %>% pad %>% fill_by_value(n) %>% ggplot(aes(ts_day, n)) + geom_point() + geom_line() + geom_smooth()
animal_bites_plot
Enable the user to apply a custom span, seq is very flexible.
seq(as.Date('2017-02-23'), as.Date('2017-03-03'), by = "3 days")
## [1] "2017-02-23" "2017-02-26" "2017-03-01"
Still need to figure out how to fit it in neatly with the interval paradigm.
There are two vignettes, a general introduction and more details on the implementation.
vignette("padr")
vignette("padr_implementation")seq(as.Date('2017-02-23', by = "3 days", length.out = 4))
I blog about changes in padr on: thats-so-random.com
And the package is maintained on: github.com/EdwinTh/padr